Practical - Week 1
2024-09-30
Data that can place a particular taxa in a particular location and time can take many forms, depending on:
Presence-only (PO) data
| PROS | CONS |
|---|---|
| huge amounts of data available, easily aggregated | often without details of effort/method, wide variation in data quality |
Presence-absence (PA) data
| PROS | CONS |
|---|---|
| absences are informative, area and effort are measured | less abundant (too time consuming), methods are species-specific |
Repeated surveys
| PROS | CONS |
|---|---|
| standardised protocols, multiple points in time | expensive: geographically restricted, usually temporally too |
Range-maps
| PROS | CONS |
|---|---|
| rough estimates of the outer boundaries of areas within which species are likely to occur | large spatial and temporal uncertainties |
Data can also be defined as how they were collected.
Structured
Semi-structured
Unstructured (opportunistic)
Finally, data can also be defined as how they are made available for others.
Disaggregated
Aggregated
GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.
OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.
eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.
eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.
iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.
IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.
rredlist: https://github.com/ropensci/rredlist
IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.
Map of Life endeavors to provide ‘best-possible’ species range information and species lists for any geographic area. The Map of Life assembles and integrates different sources of data describing species distributions worldwide.
Chorological maps for the main European woody species is a data paper with a dataset of chorological maps for the main European tree and shrub species, put together by Giovanni Caudullo, Erik Welk, and Jesús San-Miguel-Ayanz.
BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.
BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.
SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.
sibbr: https://github.com/sibbr
BioTime is an open access database global database of assemblage time series for quantifying and understanding biodiversity change.
BioTime Hub: https://github.com/bioTIMEHub
Open means anyone can freely access, use, modify, and share for any purpose.
Public doesn’t mean open
The data on the internet can be public but they are not necessarily open. They can be standard, available in open formats (e.g., csv), and yet, if they don’t have a licence, by default they are closed (all rights reserved).
Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.
countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code.
recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.
Open data are licensed under open licenses. Some examples:
CC0: Public domain
CC-BY: Attribution
CC-BY-NC: Attribution - Non Commercial
CC-BY-SA: Attribution - Share Alike
Data that are standardized and have an open licence can be shared :)
Chose a taxon, chose one data source and try to get distribution data.
Then answer the following 3 questions:
We will use the mammals of Czech Republic as an example dataset. We will access data through GBIF using tools available in R.
code and data folders inside).File > New project > New directory or Existing directory
We will always load packages into R using the package pacman.
If you attempt to load a library that is not installed, pacman will try to install it automatically.
We will use tidyverse for the manipulation and transformation of data.
We will be using many functions from this library of package, like filter(), mutate(), and later read_csv().
We will use rgbif to download data from GBIF directly into R.
We will need to get a taxon ID (taxonKey) for the Mammalia class from the GBIF backbone. For that we will use another package called taxize.
We will use sf to work with spatial data.
We will use rnaturalearth to interact with Natural Earth to get mapping data into R (e.g., countries’ polygons).
Create some variables that will be used later.
Get a taxon ID for the Mammalia class.
taxon_key <- get_gbifid_(taxa) %>%
bind_rows() %>% # Transform the result of get_gbifid into a dataframe
filter(matchtype == "EXACT" & status == "ACCEPTED") %>% # Filter the dataframe by the columns "matchtype" and "status"
pull(usagekey) # Pull the contents of the column "usagekey"
taxon_key[1] 359
Basemap of CZ to use later for plotting or checking the dataset.
And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.
How many occurrence records are in GBIF for the entire Czech Republic?
And how many records for the mammals of Czech Republic?
We are ready to do a download. Whoop!
To do this, we will use occ_search(), but see occ_download().
occ_search(
taxonKey = NULL,
scientificName = NULL,
country = NULL,
publishingCountry = NULL,
hasCoordinate = NULL,
typeStatus = NULL,
recordNumber = NULL,
lastInterpreted = NULL,
continent = NULL,
geometry = NULL,
geom_big = "asis",
geom_size = 40,
geom_n = 10,
recordedBy = NULL,
recordedByID = NULL,
identifiedByID = NULL,
basisOfRecord = NULL,
datasetKey = NULL,
eventDate = NULL,
catalogNumber = NULL,
year = NULL,
month = NULL,
decimalLatitude = NULL,
decimalLongitude = NULL,
elevation = NULL,
depth = NULL,
institutionCode = NULL,
collectionCode = NULL,
hasGeospatialIssue = NULL,
issue = NULL,
search = NULL,
mediaType = NULL,
subgenusKey = NULL,
repatriated = NULL,
phylumKey = NULL,
kingdomKey = NULL,
classKey = NULL,
orderKey = NULL,
familyKey = NULL,
genusKey = NULL,
establishmentMeans = NULL,
protocol = NULL,
license = NULL,
organismId = NULL,
publishingOrg = NULL,
stateProvince = NULL,
waterBody = NULL,
locality = NULL,
limit = 500,
start = 0,
fields = "all",
return = NULL,
facet = NULL,
facetMincount = NULL,
facetMultiselect = NULL,
skip_validate = TRUE,
curlopts = list(),
...
)Get occurrence records of mammals from Czech Republic.
Records found [8097]
Records returned [500]
No. unique hierarchies [36]
No. media records [500]
No. facets [0]
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
fields=all]
# A tibble: 500 × 100
key scientificName decimalLatitude decimalLongitude issues datasetKey
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 4518978086 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
2 4510103035 Sciurus vulgar… 49.8 13.4 cdc,c… 50c9509d-…
3 4510305990 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
4 4510153353 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
5 4510362535 Microtus arval… 50.0 16.3 cdc,c… 50c9509d-…
6 4510154668 Castor fiber L… 49.9 14.2 cdc,c… 50c9509d-…
7 4510377266 Capreolus capr… 49.2 17.4 cdc,c… 50c9509d-…
8 4510457308 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
9 4510279317 Capreolus capr… 49.4 15.7 cdc,c… 50c9509d-…
10 4512107228 Castor fiber L… 49.5 13.3 cdc,c… 50c9509d-…
# ℹ 490 more rows
# ℹ 94 more variables: publishingOrgKey <chr>, installationKey <chr>,
# hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
# lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
# occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
# classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
# speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, …
By default it will only return the first 500 records
To get all the records we need to specify a larger limit. Since we have over 8,000 records, we’ll choose 9,000 as the limit.
Records found [8097]
Records returned [8097]
No. unique hierarchies [275]
No. media records [8097]
No. facets [0]
Args [occurrenceStatus=PRESENT, limit=9000, offset=0, taxonKey=359, country=CZ,
fields=all]
# A tibble: 8,097 × 190
key scientificName decimalLatitude decimalLongitude issues datasetKey
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 4518978086 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
2 4510103035 Sciurus vulgar… 49.8 13.4 cdc,c… 50c9509d-…
3 4510305990 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
4 4510153353 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
5 4510362535 Microtus arval… 50.0 16.3 cdc,c… 50c9509d-…
6 4510154668 Castor fiber L… 49.9 14.2 cdc,c… 50c9509d-…
7 4510377266 Capreolus capr… 49.2 17.4 cdc,c… 50c9509d-…
8 4510457308 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
9 4510279317 Capreolus capr… 49.4 15.7 cdc,c… 50c9509d-…
10 4512107228 Castor fiber L… 49.5 13.3 cdc,c… 50c9509d-…
# ℹ 8,087 more rows
# ℹ 184 more variables: publishingOrgKey <chr>, installationKey <chr>,
# hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
# lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
# occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
# classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
# speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, …
Finally, we store the result in the object mammalsCZ.
mammalsCZ <- occ_search(
taxonKey = taxon_key, # Key 359 created previously
country = country_code, # CZ, ISO code of Czechia
limit = 9000, # Max number of records to download
hasGeospatialIssue = F # Only records without spatial issues
)
mammalsCZ <- mammalsCZ$data # The output of occ_search is a list with a data object inside. Here we pull the data out of the list.Mammals occurrence records from the Czech Republic
Rows: 8,045
Columns: 189
$ key <chr> "4518978086", "4510103035", "451030…
$ scientificName <chr> "Myocastor coypus (Molina, 1782)", …
$ decimalLatitude <dbl> 50.08180, 49.75979, 50.08204, 50.08…
$ decimalLongitude <dbl> 14.41210, 13.35779, 14.41030, 14.40…
$ issues <chr> "cdc,cdround", "cdc,cdround", "cdc,…
$ datasetKey <chr> "50c9509d-22c7-4a22-a47d-8c48425ef4…
$ publishingOrgKey <chr> "28eb1a3f-1c15-4a95-931a-4af90ecb57…
$ installationKey <chr> "997448a8-f762-11e1-a439-00145eb45e…
$ hostingOrganizationKey <chr> "28eb1a3f-1c15-4a95-931a-4af90ecb57…
$ publishingCountry <chr> "US", "US", "US", "US", "US", "US",…
$ protocol <chr> "DWC_ARCHIVE", "DWC_ARCHIVE", "DWC_…
$ lastCrawled <chr> "2024-09-20T15:47:23.061+00:00", "2…
$ lastParsed <chr> "2024-09-21T09:35:02.230+00:00", "2…
$ crawlId <int> 486, 486, 486, 486, 486, 486, 486, …
$ basisOfRecord <chr> "HUMAN_OBSERVATION", "HUMAN_OBSERVA…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRESENT", "P…
$ taxonKey <int> 4264680, 8211070, 4264680, 4264680,…
$ kingdomKey <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ phylumKey <int> 44, 44, 44, 44, 44, 44, 44, 44, 44,…
$ classKey <int> 359, 359, 359, 359, 359, 359, 359, …
$ orderKey <int> 1459, 1459, 1459, 1459, 1459, 1459,…
$ familyKey <int> 3240572, 9456, 3240572, 3240572, 32…
$ genusKey <int> 3240573, 2437489, 3240573, 3240573,…
$ speciesKey <int> 4264680, 8211070, 4264680, 4264680,…
$ acceptedTaxonKey <int> 4264680, 8211070, 4264680, 4264680,…
$ acceptedScientificName <chr> "Myocastor coypus (Molina, 1782)", …
$ kingdom <chr> "Animalia", "Animalia", "Animalia",…
$ phylum <chr> "Chordata", "Chordata", "Chordata",…
$ order <chr> "Rodentia", "Rodentia", "Rodentia",…
$ family <chr> "Myocastoridae", "Sciuridae", "Myoc…
$ genus <chr> "Myocastor", "Sciurus", "Myocastor"…
$ species <chr> "Myocastor coypus", "Sciurus vulgar…
$ genericName <chr> "Myocastor", "Sciurus", "Myocastor"…
$ specificEpithet <chr> "coypus", "vulgaris", "coypus", "co…
$ taxonRank <chr> "SPECIES", "SPECIES", "SPECIES", "S…
$ taxonomicStatus <chr> "ACCEPTED", "ACCEPTED", "ACCEPTED",…
$ iucnRedListCategory <chr> "LC", "LC", "LC", "LC", "LC", "LC",…
$ dateIdentified <chr> "2024-01-12T10:43:43", "2024-01-04T…
$ coordinateUncertaintyInMeters <dbl> 102, 2, 9, 3, NA, 3, 15, 4, 19, 3, …
$ continent <chr> "EUROPE", "EUROPE", "EUROPE", "EURO…
$ stateProvince <chr> "Prague", "Plzeňský", "Prague", "Pr…
$ year <int> 2024, 2024, 2024, 2024, 2024, 2024,…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day <int> 1, 2, 4, 5, 6, 6, 7, 8, 8, 10, 11, …
$ eventDate <chr> "2024-01-01T15:09:20", "2024-01-02T…
$ startDayOfYear <int> 1, 2, 4, 5, 6, 6, 7, 8, 8, 10, 11, …
$ endDayOfYear <int> 1, 2, 4, 5, 6, 6, 7, 8, 8, 10, 11, …
$ modified <chr> "2024-03-21T21:33:22.000+00:00", "2…
$ lastInterpreted <chr> "2024-09-21T09:35:02.230+00:00", "2…
$ references <chr> "https://www.inaturalist.org/observ…
$ license <chr> "http://creativecommons.org/license…
$ isSequenced <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
$ identifier <chr> "195466906", "195735620", "19574596…
$ facts <chr> "none", "none", "none", "none", "no…
$ relations <chr> "none", "none", "none", "none", "no…
$ isInCluster <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, F…
$ datasetName <chr> "iNaturalist research-grade observa…
$ recordedBy <chr> "katjawil", "Míša Peterka", "Andrej…
$ identifiedBy <chr> "manumea2000", "Míša Peterka", "And…
$ geodeticDatum <chr> "WGS84", "WGS84", "WGS84", "WGS84",…
$ class <chr> "Mammalia", "Mammalia", "Mammalia",…
$ countryCode <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ",…
$ recordedByIDs <chr> "none", "none", "none", "none", "no…
$ identifiedByIDs <chr> "none", "none", "none", "none", "no…
$ gbifRegion <chr> "EUROPE", "EUROPE", "EUROPE", "EURO…
$ country <chr> "Czechia", "Czechia", "Czechia", "C…
$ publishedByGbifRegion <chr> "NORTH_AMERICA", "NORTH_AMERICA", "…
$ rightsHolder <chr> "katjawil", "Míša Peterka", "Andrej…
$ identifier.1 <chr> "195466906", "195735620", "19574596…
$ http...unknown.org.nick <chr> "katjawil", "peterkam", "andrej_fun…
$ verbatimEventDate <chr> "2024-01-01 15:09:20+01:00", "2024/…
$ collectionCode <chr> "Observations", "Observations", "Ob…
$ verbatimLocality <chr> "Vltava, Prague 1, Prag, CZ", "Plze…
$ gbifID <chr> "4518978086", "4510103035", "451030…
$ occurrenceID <chr> "https://www.inaturalist.org/observ…
$ taxonID <chr> "43997", "46001", "43997", "43997",…
$ catalogNumber <chr> "195466906", "195735620", "19574596…
$ institutionCode <chr> "iNaturalist", "iNaturalist", "iNat…
$ eventTime <chr> "15:09:20+01:00", "13:09:00+01:00",…
$ http...unknown.org.captive <chr> "wild", "wild", "wild", "wild", "wi…
$ identificationID <chr> "442516619", "440401646", "44043095…
$ name <chr> "Myocastor coypus (Molina, 1782)", …
$ recordedByIDs.type <chr> NA, NA, NA, NA, NA, NA, NA, "ORCID"…
$ recordedByIDs.value <chr> NA, NA, NA, NA, NA, NA, NA, "https:…
$ identifiedByIDs.type <chr> NA, NA, NA, NA, NA, NA, NA, "ORCID"…
$ identifiedByIDs.value <chr> NA, NA, NA, NA, NA, NA, NA, "https:…
$ informationWithheld <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ occurrenceRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lifeStage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ individualCount <int> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ samplingProtocol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ habitat <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ vernacularName <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locality <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationVerificationStatus <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ infraspecificEpithet <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ distanceFromCentroidInMeters <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ dataGeneralizations <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ datasetID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ language <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ accessRights <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ recordNumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.taxonRankID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ dynamicProperties <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxonConceptID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxonRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ projectId <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismQuantity <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismQuantityType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ otherCatalogNumbers <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ gadm <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ associatedSequences <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ networkKeys <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ coordinatePrecision <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ institutionKey <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ acceptedNameUsage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferencedBy <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ collectionKey <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ preparations <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ institutionID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ nomenclaturalCode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ disposition <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bibliographicCitation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ collectionID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.language <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ footprintWKT <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.modified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ originalNameUsage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ nameAccordingTo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ elevation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ elevationAccuracy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fieldNumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherGeography <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationAccordingTo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferencedDate <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceProtocol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimCoordinateSystem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ previousIdentifications <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationQualifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherClassification <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceSources <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ownerInstitutionCode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ materialEntityID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ footprintSRS <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimIdentification <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.recordID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ county <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rights <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.recordEnteredBy <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceVerificationStatus <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ establishmentMeans <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ parentNameUsage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ island <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ materialSampleID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ associatedReferences <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimElevation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherGeographyID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ combinationAuthors <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimScientificName <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.verbatimLabel <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ combinationYear <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ canonicalName <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEonOrLowestEonothem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEonOrHighestEonothem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEraOrLowestErathem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEraOrHighestErathem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestPeriodOrLowestSystem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestPeriodOrHighestSystem <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEpochOrLowestSeries <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEpochOrHighestSeries <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ municipality <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestAgeOrLowestStage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ namePublishedInYear <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lithostratigraphicTerms <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimTaxonRank <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestAgeOrHighestStage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ formation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bed <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geologicalContextID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Check the data output. How many rows and columns does it have?
Mammals occurrence records from the Czech Republic
How many records do we have?
Data are not ‘good’ or ‘bad’, the quality will depend on our goal.
Some things we can check:
CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner
Automated flagging of common spatial and temporal errors in data.
As an example of data cleaning procedures, we will check the following fields in our dataset:
basisOfRecord: we want preserved specimens or observationstaxonRank: we want records at species level.coordinateUncertaintyInMeters: we want it to be smaller than 10km.basisOfRecord: we want preserved specimens or observationsbasisOfRecord: we want preserved specimens or observations# A tibble: 7 × 2
# Groups: basisOfRecord [7]
basisOfRecord n
<chr> <int>
1 FOSSIL_SPECIMEN 200
2 HUMAN_OBSERVATION 6524
3 MATERIAL_CITATION 206
4 MATERIAL_SAMPLE 105
5 OBSERVATION 77
6 OCCURRENCE 11
7 PRESERVED_SPECIMEN 922
group_by() is used to group values within a variable
basisOfRecord: we want preserved specimens or observationsNote the use of | (OR) to filter the data. Another alternative is filter(basisOfRecord %in% c("PRESERVED_SPECIMEN","HUMAN_OBSERVATION")).
taxonRank: we want records at species leveltaxonRank: we want records at species levelcoordinateUncertaintyInMeters: we want them to be smaller than 10kmmammalsCZ %>%
filter(coordinateUncertaintyInMeters >= 10000) %>%
select(scientificName,
coordinateUncertaintyInMeters,
stateProvince)# A tibble: 309 × 3
scientificName coordinateUncertaint…¹ stateProvince
<chr> <dbl> <chr>
1 Bison bonasus (Linnaeus, 1758) 26389 Středočeský
2 Bison bonasus (Linnaeus, 1758) 26389 Středočeský
3 Lutra lutra (Linnaeus, 1758) 26614 Jihomoravský
4 Procyon lotor (Linnaeus, 1758) 26582 Jihočeský
5 Sciurus vulgaris Linnaeus, 1758 22379 Prague
6 Clethrionomys glareolus (Schreber, 1780) 26550 Jihočeský
7 Rhinolophus hipposideros (Bechstein, 18… 26454 Moravskoslez…
8 Lutra lutra (Linnaeus, 1758) 26614 Niederösterr…
9 Rhinolophus hipposideros (Bechstein, 18… 26454 Moravskoslez…
10 Myotis myotis (Borkhausen, 1797) 26454 Moravskoslez…
# ℹ 299 more rows
# ℹ abbreviated name: ¹coordinateUncertaintyInMeters
coordinateUncertaintyInMeters: we want them to be smaller than 10kmHow are the records distributed?
We’ll get to this next week :)
And finally, a simple trick to produce separate maps per order.